Multiclass linear classifier:
How we can interpret:
Limitations:
Use softmax function to convert scores to probabilities:
$$ s = f(x,W) \\ P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}} $$Steps:
Example of multiclass SVM loss:
Takeaways:
Example of how this is calculated with image class prediction:
Cross-entropy and MLE:
Example using the image classification task:
Takeaways:
Regularization accounts for choosing simpler models over complex ones. This is applied to the loss function:
L1 regularization norms the weight vector: $$ L_i = |y-Wx_i|^2+|W| $$
Given a scalar $s \in \mathbb{R}^1$ and vector $v \in \mathbb{R}^m$:
Q: Given 2 vectors, what's the size of $\frac{\nabla v^1}{\nabla v^2}$?
Q: Given a scalar and a matrix, what's the size of $\frac{\nabla s}{\nabla M}$?
Q: What is the size of $\frac{\nabla L}{\nabla W}$?
Takeaway:
What makes it different:
What is deep learning:
Features are engineered in traditional ML:
Features are automatically extracted in deep learning:
Example of features for image detection:
Representation of data is done through building up a network of simple functions into a complex network.
"End-to-end": Learning is applied to entire spectrum, from raw data -> feature extraction -> classification.
Algorithm:
Applying batch gradient descent:
Convergence notes:
How to compute $\frac{\nabla L}{\nabla W_i}$?
Derivation of update rule using squared loss:
The partial derivative of this summation (with respect to $w_j$ is really causing most of the terms to be zero because when $i$ is not equal to $j$ then none of those weights actually affect $w_j$.
(Some more context on getting the partial derivative of update rule above)
Update rule once we add non-linearity (sigmoid) - Gets more complex:
Manual differentiation can get messy. We can decompose the complicated function into modular sub-blocks.
Key ideas:
Distributed representation: Toy example